Help biocurators to maximize the reach of your data

Curated scientific databases catalogue and amplify research findings to maximize their reach. Authors should write their papers with this in mind, ensuring that data are accurate, easy to extract, and presented in standardized formats.

annually to research impacts [3].In the future, these resources are likely to increase further in value for sectors that utilize big data-dependent approaches.
However, to effectively empower such resources, the data within papers must be curatable; the data should be accurate, easy to extract, and presented in standardized formats.Unfortunately, this is not always the case, and all curators experience some consistent and longstanding problems across the biological literature that hinder curation.While the application of advances in artificial intelligence will advance the field, the problems we identify below will likely persist.Although some of these issues have been discussed in the literature [4][5][6], many tend to be discussed informally within the curation community and do not reach the general biology community.The main problems include the following: 1.Not publishing the underlying data.This is the most obvious problem and the easiest to rectify.Summary tables and figures are presented, but the underlying data areAU : Pleasenotethatasp often missing.While you can turn a carrot into cake, you cannot turn a cake into a carrot.It is relatively easy to turn text and numbers into a nice figure, but to turn that figure back into raw data is often impossible.This could be solved quite simply by publishing all the underlying data.
2. Inappropriate formatting.A restaurant would not serve you a photo of the meal you ordered.If you have a spreadsheet, why would you save it as an image file?Not being able to copy and paste data or to clearly read it decreases the possibility of curation.Again, this can be solved simply by paying attention to proper formatting.

Annotation and accessibility of data in external repositories.
The use of repositories is often recommended by funders and journals, and there are sound reasons for this: They increase trust and confidence in the quality of data, help align it with the FAIR principles, and increase the number of citations.However, authors and reviewers need to consider the accessibility and presentation of any submitted data.Being publicly available in principle and in practice are often not the same thing, as Douglas Adams wrote in relation to an important piece of planning permission: "It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying 'Beware of the Leopard'" [7].This is perhaps the biggest and most controversial problem for external repositories, given their popularity.Repositories are frequently inaccessible to curators due to access permissions, and there is a lack of sufficient control on the content and format of submitted data.Often, only raw data areAU : PleasenotethatasperPLOSstyle; }data}takespluralverb:Hence; }O included, particularly for genomic sequencing data, meaning that complex bioinformatic processing is required to recapitulate the summarized data in the original paper.This is often practically impossible for curators to do, as methodologies can be unclear or use bespoke and unavailable tools.There are also considerable risks that data areAU : PleasenotethatasperPLOSstyle; }data}takespluralverb:Hence; }Therearealsoconsidera lost or made otherwise inaccessible when repositories undergo budgetary contractions and/or are retired [8][9][10].If external repositories are used, then authors should ensure that processed data (such as vcf files) as well as raw data areAU : PleasenotethatasperPLOSstyle; }dat uploaded and that access is not restricted or otherwise impaired.4. Third-party services restricting data.This is a relatively new but growing problem.Have you ever bought a car only to find that you have to pay a lot more money simply to unlock some of its features?Some providers of sequencing services do not release all the data they generate back to researchers as standard; instead, researchers may only get a partial description of mutations, not the complete details.Communities, authors, and journals can solve this by establishing minimal datasets and standards, such as those that already underlie AGR resources, and we are pleased to note that such discussions are already happening elsewhere [11,12].

5.
Accuracy.Occasional small errors in complex works are understandable, despite the best efforts of authors and reviewers to minimize these.Curators can help by correcting obvious mistakes.However, frequent small errors affect the quality of the work and will affect decisions to curate.Quality control tools could be developed for use prior to submission to help reduce this problem.
All the above are potentially rectifiable, but this leads to the ultimate problem: When asked, too many authors do not respond to requests to share their data despite this being a condition of publication and/or funding.Even when publishers mandate data sharing, requests are often ignored [13,14].This represents a serious threat to the ability of data resources to extract data, as well as to the general credibility of research in general.How can we solve this problem?
Incentivizing the sharing of data requires the involvement of many stakeholders.Data resources could cite source publications in a way that counts towards a paper's total citations.Some large funding agencies already insist on postpublication data sharing upon request (e.g., the NIH and UK Research and Innovation), and this should be expanded and enforced.Institutions could regard the failure to share data by authors as a notifiable offence.Journals could encourage curatable formats and robustly enforce data sharing commitments.Ultimately though, the responsibility will fall on authors as the creators and initial custodians of their data.
Modern scientific publishing can place requirements on authors that, while necessary, can be time consuming and complex to satisfy, and our suggestions will no doubt risk adding further complexity and frustration to the publication process.We are conscious of this and recognize that there are many different perspectives to consider other than our own.While it would be unreasonable to expect authors to write papers solely to our requirements, we think that the single most important thing any author can do is to place as much of their data as possible in simple plain text documents as supplemental data.If a summary table is presented in the main text, then the underlying data should be published as well.If a data table is presented, it should be available as a spreadsheet, not (just) as an image, pdf, or other nonextractable format.Making it pretty or excluding data for the sake of layout is not important; curators just want to curate your papers as best we can and for your benefit.By including all your data in simple formats, you make your paper curatable and you make it easy for us to promote and amplify your data, and who would not want that?